Defeating the Homogeneity Assumption

نویسندگان

  • A.De Roeck
  • A. Sarkar
  • P. H. Garthwaite
چکیده

The statistical NLP and IR literatures tend to make a “homogeneity assumption” about the distribution of terms, either by adopting a “bag of words” model, or in their treatment of function words. In this paper we develop a notion of homogeneity detection to a level of statistical significance, and conduct a series of experiments on different datasets, to show that the homogeneity assumption does not generally hold. We show that it also does not hold for function words. Importantly, datasets and document collections are found not to be neutral with respect to the property of homogeneity, even for function words. The homogeneity assumption is defeated substantially even for collections known to contain similar documents, and more drastically for diverse collections. We conclude that it is statistically unreasonable to assume that word distribution within a corpus is homogeneous. Because homogeneity findings differ substantially between different collections, we argue for the use of homogeneity measures as a means of profiling datasets.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Effectiveness of Acceptance and Commitment Training prgram on motivational beliefs and future time perspective for students with academic self-defeating behaviors.

Abstract: Introductin: The purpoe of this study was to study of the effectiveness of training program based on acceptance and commitment approach on motivational beliefs and future time perspective for students with academic self-defeating behaviors in valiasr University of Rafsanjan. Method: The research was a semi-experimental design with pre-test and post-test design with control group. ...

متن کامل

Develop an educational package on perceptions of school climate and its Feasibility on self-defeating academic behaviors of male students

The aim of this study was to develop an educational package on the perception of the school environment and its feasibility study on the self-defeating academic behaviors of male students. The research method was quasi-experimental with pre-test and post-test design with a control group and quarterly follow-up. The statistical population included all students studying in the second year of high...

متن کامل

Remarks on the Frisch framework of hydrodynamic turbulence and the quasi-Lagrangian formulation

In this paper, we revisit the claim that the Eulerian and quasi-Lagrangian same time correlation tensors are equal. This statement allows us to transform the results of an MSR quasi-Lagrangian statistical theory of hydrodynamic turbulence back to the Eulerian representation. We define a hierarchy of homogeneity symmetries between the local homogeneity of Frisch and global homogeneity. It is sho...

متن کامل

On the elimination of the sweeping interactions from theories of hydrodynamic turbulence

In this paper, we revisit the claim that the Eulerian and quasi-Lagrangian same time correlation tensors are equal. This statement allows us to transform the results of an MSR quasi-Lagrangian statistical theory of hydrodynamic turbulence back to the Eulerian representation. We define a hierarchy of homogeneity symmetries between the local homogeneity of Frisch and global homogeneity. It is sho...

متن کامل

Global Nonlinear Brascamp–lieb Inequalities

We prove global versions of certain known nonlinear Brascamp– Lieb inequalities under a natural homogeneity assumption. We also establish a conditional theorem allowing one to generally pass from local to global nonlinear Brascamp–Lieb estimates under such a homogeneity assumption.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004